In [1]:
In [2]:
G:\KaggleProject\MercariPriceSuggestionChallenge\code\feature_enginnering

查看数据集的整体情况

  • 数据集的大小
  • 数据集的缺失情况
  • 每列数据的类型
  • 数值型数据的description
In [3]:
train shape: (1482535, 8)

test shape: (693359, 7)

train is null: 
train_id             False
name                 False
item_condition_id    False
category_name         True
brand_name            True
price                False
shipping             False
item_description      True
dtype: bool

test is null: 
test_id              False
name                 False
item_condition_id    False
category_name         True
brand_name            True
shipping             False
item_description     False
dtype: bool

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 8 columns):
train_id             1482535 non-null int64
name                 1482535 non-null object
item_condition_id    1482535 non-null int64
category_name        1476208 non-null object
brand_name           849853 non-null object
price                1482535 non-null float64
shipping             1482535 non-null int64
item_description     1482531 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 90.5+ MB
Out[3]:
train_id item_condition_id price shipping
count 1.482535e+06 1.482535e+06 1.482535e+06 1.482535e+06
mean 7.412670e+05 1.907380e+00 2.673752e+01 4.472744e-01
std 4.279711e+05 9.031586e-01 3.858607e+01 4.972124e-01
min 0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00
25% 3.706335e+05 1.000000e+00 1.000000e+01 0.000000e+00
50% 7.412670e+05 2.000000e+00 1.700000e+01 0.000000e+00
75% 1.111900e+06 3.000000e+00 2.900000e+01 1.000000e+00
max 1.482534e+06 5.000000e+00 2.009000e+03 1.000000e+00
In [4]:
Out[4]:
train_id name item_condition_id category_name brand_name price shipping item_description
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
3 3 Leather Horse Statues 1 Home/Home D茅cor/Home D茅cor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity

下面开始逐列分析各自的数据情况和特点。

- 目标列:Price

分析这一列数据的分布情况,然后对其做log转换

In [5]:

- Shipping

分析包邮和不包邮的数目情况和价格分布情况。

In [6]:
查看卖家包邮shipping = 1,和不包邮shipping = 0占的比例,以及覆盖的价格分布
0    0.552726
1    0.447274
Name: shipping, dtype: float64

- category_name

统计category_name中出现的类别值的数目,Null值的情况,并对类别进行划分成子类。

In [7]:
There are 1287 unique values in the category column.

Women/Athletic Apparel/Pants, Tights, Leggings    60177
Women/Tops & Blouses/T-Shirts                     46380
Beauty/Makeup/Face                                34335
Beauty/Makeup/Lips                                29910
Electronics/Video Games & Consoles/Games          26557
Name: category_name, dtype: int64

There are 6327 items that do not have a label.

分割开categories,生成大类-子类1-子类2

In [8]:
Out[8]:
train_id name item_condition_id category_name brand_name price shipping item_description general_cat subcat_1 subcat_2
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet Men Tops T-shirts
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ... Electronics Computers & Tablets Components & Parts
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol... Women Tops & Blouses Blouse
3 3 Leather Horse Statues 1 Home/Home D茅cor/Home D茅cor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]... Home Home D茅cor Home D茅cor Accents
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity Women Jewelry Necklaces
In [9]:
There are 11 unique general_cat.
There are 114 unique first sub-categories.
There are 871 unique second sub-categories.
Out[9]:
Women                     664385
Beauty                    207828
Kids                      171689
Electronics               122690
Men                        93680
Home                       67871
Vintage & Collectibles     46530
Other                      45351
Handmade                   30842
Sports & Outdoors          25342
No Label                    6327
Name: general_cat, dtype: int64
In [10]:
WomenBeautyKidsElectronicsMenHomeVintage & CollectiblesOtherHandmadeSports & OutdoorsNo Label0100k200k300k400k500k600kExport to plot.ly »
Number of Items by Main CategoryCategoryCount
In [11]:
Athletic ApparelMakeupTops & BlousesShoesJewelryToysCell Phones & AccessoriesWomen's HandbagsDressesWomen's AccessoriesJeansVideo Games & ConsolesSweatersUnderwearSkin Care020k40k60k80k100k120k140kExport to plot.ly »
Number of Items by Sub Category (Top 15)40k60k80k100k120kSubCategoryCount
In [12]:
012345678MenElectronicsWomenHomeSports & OutdoorsVintage & CollectiblesBeautyOtherKidsNo LabelHandmadeExport to plot.ly »
Price Distribution by General CategoryMenElectronicsWomenHomeSports & OutdoorsVintage & CollectiblesBeautyOtherKidsNo LabelHandmadeCategoryFrequency

- brand_name

分析品牌的数目以及各品牌的item数量

In [13]:
There are 4809 unique brand names in the training dataset.
PINKNikeVictoria's SecretLuLaRoeAppleFOREVER 21NintendoLululemonMichael KorsAmerican Eagle010k20k30k40k50kExport to plot.ly »
Top 10 Brand by Number of Items15k20k25k30k35k40k45k50kCountBrand Name

- item_description

item_description是非结构化的的数据,所以分析起来比较困难 这里先分析下描述的长度是否和价格有相关性。其中长度是指除去标点符号之后然后再除去英语停止词的且长度大于3单词的个数

In [14]:
Out[14]:
train_id name item_condition_id category_name brand_name price shipping item_description general_cat subcat_1 subcat_2 desc_len
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet Men Tops T-shirts 1
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ... Electronics Computers & Tablets Components & Parts 14
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol... Women Tops & Blouses Blouse 8
3 3 Leather Horse Statues 1 Home/Home D茅cor/Home D茅cor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]... Home Home D茅cor Home D茅cor Accents 14
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity Women Jewelry Necklaces 3
In [15]:
0204060801001201402.62.833.23.43.63.8Export to plot.ly »
Average Log(Price) by Description LengthDescription LengthAverage Log(Price)
In [16]:
train_df.item_description.isnull().sum() = 4

接着为每个主类创建各自的词典

In [17]:

统计各个类别上的单词频数

In [21]:
newsizebrandfreeconditionshippingwornusednevergreatblackpricecolorpinkonebundlesmalllikepleasegood0100k200k300k400k500kExport to plot.ly »
Word FrequencyWordCount
In [ ]: